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Abstract 

Motivated by applications to distributed optimization over networks and large-scale 
data processing in machine learning, we analyze the deterministic incremental aggre¬ 
gated gradient method for minimizing a finite sum of smooth functions where the sum 
is strongly convex. This method processes the functions one at a time in a deterministic 
order and incorporates a memory of previous gradient values to accelerate convergence. 
Empirically it performs well in practice; however, no theoretical analysis with explicit 
rate results was previously given in the literature to our knowledge, in particular most of 
the recent efforts concentrated on the randomized versions. In this paper, we show that 
this deterministic algorithm has global linear convergence and characterize the conver¬ 
gence rate. We also consider an aggregated method with momentum and demonstrate 
its linear convergence. Our proofs rely on a careful choice of a Lyapunov function that 
offers insight into the algorithm’s behavior and simplifies the proofs considerably. 


1 Introduction 

We consider the following unconstrained optimization problem where the objective function 
is the sum of component functions: 


mm/(a;) = 5^/ba;) (1) 

where each /j : R”' —)■ R is a convex, continuously differentiable function referred to as a 
component function. This problem arises in many applications including least square prob¬ 
lems mM or more general parameter estimation problems where /j is the corresponding loss 
function of the i-th data block [1^ , distributed optimization in wireless sensor networks j5] , 
machine learning problems PI20] and minimization of expected value of a function (where the 
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expectation is taken over a finite probability distribution or approximated by an m-sample 
average). 

One widely studied approach is the (deterministic) incremental gradient (IG) method, 
which cycles through the component functions using a cyclic order and updates the iterates 
using the gradient of a single component function one at a time [2]. This method can 
be faster than non-incremental methods since each step is relatively cheaper (one gradient 
computation instead of m gradient computations in the non-incremental case) and each step 
makes reasonable progress on average [2]. However, IG requires the stepsize to go to zero to 
obtain convergence to an optimal solution of problem ([1]) even if it is applied to smooth and 
strongly convex component functions [5, unless a restrictive “gradient growth condition” 
holds [23]. As a consequence, with a decaying stepsize, IG has typical sublinear convergence 
rate properties. The same observation applies both to stochastic gradient methods (which 
uses a random order for cycling through the component functions) [22] and to incremental 
Newton methods that are of second-order [TO] . 

Another interesting class of methods includes the incremental aggregated gradient (lAG) 
method of Blatt et al. (see [3 [21]) and closely-related stochastic methods including the 
stochastic average gradient (SAG) method [2D], the SAGA method [7] and the MISO method 
[T2] . The applications include but are not limited to logistic regression and binary regression 
with ^2 regularization and more recently training conditional random helds [21]. These 
methods process a single component function at a time as in incremental methods, but keeps 
a memory of the most recent gradients of all component functions so that an approximate 
gradient descent direction of / is taken at each iteration. They might require an excessive 
amount of memory when m is large, however they have fast convergence properties on strongly 
convex functions with constant stepsize without requiring the restrictive gradient growth 
condition. Furthermore, lAG forms approximations to the gradient of the objective function 
V/(x) at each step and this provides an accurate and efficient stopping criterion (stop if the 
norm of the approximate gradient is below a certain threshold) whereas it is often not clear 
“when to stop” with IG. 

The lAG method was first proposed in a pioneer work by [5] where its global convergence 
under some assumptions is shown. It is also shown that in the special case when each fi is 
a quadratic, lAG exhibits global linear convergence if the stepsize is small enough; however, 
neither an explicit convergence rate nor an explicit upper bound on the stepsize that can 
lead to linear convergence was given. This result is based on a perturbation analysis of the 
eigenvalues of a periodic dynamic linear system which is of independent interest in terms of 
the techniques used but is also highly technical and computationally demanding as it requires 
estimating the derivatives of the eigenvalues of a one-parameter matrix family. Furthermore, 
it only applies to quadratic functions. More recently, Tseng and Yun [23] proved global 
convergence under less restrictive conditions and local linear convergence in a more general 
setting when each component function satisfies a local Lipschitzian error condition, a condi¬ 
tion satisfied by locally strongly convex functions (around an optimal solution). Although 
the results are more general than those of [5] as they apply beyond quadratics, the proofs 
are still involved and do not contain any explicit rate estimates because (i) the constants 
involved in the analysis are implicit and hard to compute/approximate, {ii) the results are 
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asymptotic (they hold when the stepsize is small enough but bounds on the stepsize are not 
available). See Remark 13.41 for more detail. 

In this paper, we present a novel convergence analysis for the lAG method with several 
advantages and implications. First, our analysis is based on a careful choice of a Lyapunov 
function which leads to simple global and linear convergence proofs. Furthermore, our proofs 
give more insight into the behavior of lAG compared to previous approaches, showing that 
lAG can be treated as a perturbed gradient descent method where gradient errors can be 
interpreted as shocks with a hnite duration that are fading away as one gets closer to an 
optimal solution (in a way we will make precise). Second, to our knowledge, our analysis 
is the first to provide explicit rate estimates for (deterministic) lAG methods. Third, we 
discuss an “lAG method with momentum” and show its global and linear convergence. To our 
knowledge, this is the first global convergence and linear convergence result for an aggregated 
gradient method with memory. 

In many applications, there is a favorable deterministic order to process the functions. 
For instance, in source localization or parameter estimation problems over sensor networks, 
sensors are a part of a big network structure and are only able to communicate with their 
neighbors subject to certain constraints in terms of geography and distance, and this may 
enforce to follow a particular deterministic order (see e.g. 0). There exists other similar 
scenarios where the data is distributed over different units that are connected to each other 
in a particular fashion (e.g. connected through a ring network) where a local optimization 
algorithm that accesses each unit in a particular order is favorable [IH]. These applications 
motivate the study of deterministic incremental methods such as lAG which performs well 
in practice [5]. 

There has been some recent work on the SAG algorithm, the stochastic version of the 
lAG method where the order is stochastic. For SAG, Le Roux et al. [20] and later Defazio 
et al. [7] established a global linear convergence rate in (expected cost) expectation which is 
a weaker averaged sense of convergence compared to the deterministic convergence we will 
consider in this work. In particular, our results applies to any order (as long as each function 
is visited at least once in a finite number of steps K) and hold deterministically (not in a 
probabilistic sense). As the performance of the determistic incremental methods are sensitive 
to the specihc order chosen (see e.g. Example 1.5.6], [U Figure 2.1.9]), lAG can be slower 
than SAG if an unfavorable order is chosen and the analysis of lAG has to account for this 
worst-case scenarios. This is also reflected in Theorem 13.31 where the rate of lAG has a worse 
(quadratic) dependance in the condition number Q (see ([6|) for a dehnition) whereas SAG has 
linear dependance [201 Prop 1]. We also note that most of the proofs and proof techniques 
used in the stochastic setting such as the fact that the expected gradient error is zero do not 
apply to the deterministic setting and this requires a new approach for analyzing lAG. 

In the next section. Section 2, we describe the lAG method. Section [3] introduces the 
main assumptions and estimates for our convergence analysis and the linear rate result. In 
Section 01 we develop a new lAG method with momentum and provide a linear rate result 
for it. In Section [5l we conclude by discussing summary and future work. 


3 






Notation We use || • || to denote the standard Euclidean norm on M"'. For a real scalar x, 
we define (x)+ = max(x, 0). The gradient and the Hessian matrix of / at a point x G M”' 
are denoted by V/(x) and V^/(x) respectively. The Euclidean (dot) inner product of two 
vectors ui,U2 G M” is denoted by (ui,U2)- 


2 lAG method 


For a constant stepsize 7 > 0, an integer it' > 0 and arbitrary initial points x , x ^ G 
M"", the lAG method consists of the following iterations, 

m 

9“ = ( 2 ) 

^k+1 ^ fc = 0,1,2,(3) 

where the gradient sampling times {t^ can be arbitrary as long as they are sampled at 
least once in the last K iterations, i.e. 

k > > k — K, i = 1,2,... ,m. (4) 


In other words, K is an upper bound on the delay encountered by the gradients of the compo¬ 
nent functions. The update ([2]) determines the direction of motion —g’^ by approximating the 
steepest descent direction — V/(x*') at (iterate) time k from the recently computed gradients 
of the component functions (at times For example, if the component functions are 

processed one by one using a deterministic cyclic order on the index set {1,2 ,... ,m} with 
initialization r? = 0 for all i, then rf admits the recursion 


T^ = 


k, ii i = (k — 1 

k-l 


mod m) -|- 1 




else. 


1 < i < m, k = 1,2, 


which satisfies rf > k — [m — 1) for all i, k where K = m — 1. This is the original lAG 
method introduced by Blatt et al. [S]. Later on, Tseng and Yun [23] generalized this method 
by allowing more general gradient sampling times {t^} with bounded delays, i.e. satisfying 


As lAG takes an approximate steepest gradient descent direction, it is natural to analyze 
it as a perturbed steepest gradient descent method. In fact, in the special case, when K = 0, 
all the gradients are up-to-date and lAG reduces to the classical (non-incremental) gradient 
descent (GD) method which is well known. Therefore, the more interesting case that we will 
analyze is when K is strictly positive. For simplicity of notation in our analysis, we will also 
take 

x° = x~^ = • • • = x~^ (5) 

This results in the initialization = 0 for all i. However, it will be clear that our analysis 
can be extended to other (arbitrary) choices of initial points in a straightforward 

manner. 
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3 Convergence analysis 

3.1 Preliminaries 

We will make the following assumptions that have appeared in a number of papers analyzing 
incremental methods including ra. iHi, la, 0 and [20] . 

Assumption 3.1. (Strong convexity and Lipschitz gradients) 

(i) Each fi is Li-smooth meaning it has Lipschitz continuous gradients on M” satisfying 

\\VMy)-Vfiiz)\\<L,\\y-z\\, Wy,zeW, 

where Lj > 0 is the Lipschitz constant, i = 1,2,... ,m. Let L = 

(a) The function f is strongly convex on M” with parameter y > 0 meaning that the function 
X f{x) — |||x|p is convex. 

It follows by the triangle inequality that / has Lipschitz continous gradients with a Lipschitz 
constant We define the condition number of / as 

Q = ->1 (6) 

T 

(see e.g. [IE]). As an example, in the special case when each fi is a quadratic function, V^fi{x) 
is a constant (matrix) for each i and we can take Li to be its largest eigenvalue whereas the 
strong convexity constant c can be taken as the smallest eigenvalue of the Hessian of /. 

A consequence of Assumption 13.11 on the strong convexity of / is that there exists a 
unique optimal solution of the problem ([T]) which we denote by x*. In addition to the strong 
convexity, by the gradient Lipschitzness assumption, we have 


||V/(a:)|| < L||a;-a;*||, 

Vx e M", 

(7) 

f{x)-f{x*) < ^\\x-x*\\‘^. 

Vx e M”, 

( 8 ) 


(see [m Theorem 2.1.5]). In addition, it follows from [TEl Theorem 2.1.12] that 

{Vf{x),x -X*) > -^^|]a;-a;*||^ + ^— ||V/(a;)||^ Vx G 

We hnally introduce the following lemma for proving the linear rate for lAG. We omit the 
proof due to space considerations, for a proof see Feyzmadhavian et al. [S] where this lemma is 
used to analyze the effect of delays in a first-order method. Coarsely speaking, the intuition 
behind this lemma is the following: If a non-negative sequence {14} that decays to zero 
linearly obeying 14 +i < pl 4 for some p < 1 is perturbed with an additive (noise) shock 

^The Lipschitz constant L = Li of V/ may not be the best Lipschitz constant as Lipschitz constants 
Li of Vfi are subadditive. 
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term that depends on the recent history, i.e. the shock at step k is on the order of Vi where 
£ G [fc — d{k), k] and d{k) is the time interval (dnration) of the shock, the linear convergence 
property can be preserved if the shocks are small enongh but this comes at the expense of a 
degraded rate r > p which is determined by the amplitude of the shocks (controlled by the 
parameter q) and the duration of the shocks. 

Lemma 3.2. Let {14} be a sequence of non-negative real numbers satisfying 

14+1 <pVk + q max I 4 , k>0, 

{k-d{k))^<e<k 

for some non-negative constants p and q. If p-{- q < I and 0 < d{k) < d ma v for some positive 
constant dmax; then 

Vk < rVo, k>l, 

where r = {p + q) i+^max , 


3.2 Bounding gradient error 

We denote the distance to the optimal solution at iterate k by 


distfc 



(9) 


and the gradient error by 


= g‘ - V/(l‘). 


( 10 ) 


We will show that the gradient error can be bounded in terms of a finite sum involving dis¬ 
tances of iterates to the optimal solution. Using the triangle inequality and the Lipschitzness 
of the gradients, for any fc > 0, 


ell <5^||V/4 x"')-V/4x^)|| (11) 

i=l i=l 


As the gradient delays are bounded by K (see (0])), by a repetitive application of the triangle 
inequality, we obtain for any fc > 0, 

m k—1 k—1 

||e^|| < ^ Lj ^ — x4| < L ^ \\xt^^ — xt\\ (12) 

i=l j=r^ 3 = {k-K)+ 

k—\ k—1 / \ 

= ^ \W\\<lL (l|V/(V)|| + ||el||) (13) 

j={k-K)+ j={k-K)+ ^ ^ 

(with the convention that lie'll! = 0 which is implied by ([5])). The inequality flT^ provides 
a recursive upper bound for the gradient error that relates the gradient error ||e^|| to the 
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previous gradient errors ||e-^ || with j < k. Bounding the ||e-^ || term in ffT^ with the previously 
obtained bound flTTl) on the gradient error and by the triangle inequality, 


fc-i 


< iL Y . (iv/(2^'')ii + E^‘ 

j={k-K)+ ^ 


Uc » — X-' 


2=1 


(14) 


k-l 


< ( l|V/(a;^)||-X*|l + 

j={k-K)+ 


2=1 


k-l 


j={k-K)+ 


< -yL [ II V/(a:-^)|| + Lj( max dist^ + distj) 

\ r .71 m. 




Invoking (jl]) on the boundedness of the gradient delays by K once more and using ([7]) on 
gradient Lipschitzness to bound the norm of the gradient, we finally get 


fc-i 


j={k-K)+ 

k-l 


< 'jL f IIV/(x-^)|| + 2L max dist^ 

^ ' U-K)+<e<j 


j={k-K)+ 

k-l 


< vL > { L dist. + 2L max dist^ 

^ ' U-K)+<e<j 


j={k-K)^ 


< vL f 3L max dist^ | < 3'yL^K max dist^. (15) 

V {j-K)+<e<j j (k- 2 K) 4 .<t<k-l 


(k-2K)+<t<k-l 


3.3 Linear convergence analysis 

From the lAG update formulae (l2])-(j3]) and the definition ffTOj) of the gradient error, it follows 
directly by taking norm squares that 

distal = distl-2^{Vf{x^),x^-x*)+^^\\Vf{x^W + E, (16) 

where the gradient errors are incapsulated by the last term 

Ek = - 2-i{x^ - x* - jVf{x^),e^). (17) 

Using the inequality ([9]) for strongly convex functions, 

distfc+i < (l“ 27 -^^^^^distfc + 7(^7--^^j^^^||V/(x'')||2 + Efc (18) 

- f 1 “ 27 -^^ dist^ + Efc, if 7<^-r- (19) 

\ fi + LJ 11 + L 

Note that when 77 = 0, lAG reduces to the GD method where = 0 and the Ek term 
vanishes. In this special case, the last inequality simplifies to 

\\x‘‘ -x'- 7 V/(i‘)||= < fl - 27 ^VistJ. < distj (20) 
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and a choice of 7 = leads to the global linear convergence satisfying 

= Ik* - V - ^V/(.‘)|| < < (1^) di.t„. 

This is the standard analysis of the GD method (see [T 6 l Theorem 2.1.15]). The next theorem 
shows that in the more interesting general case when there are gradient errors, i.e. when 
K > 0 and 7 ^ 0, a similar linear convergence argnment can be done althongh one has to 
nse a smaller stepsize to compensate for the gradient errors. The main idea is to eliminate the 
gradient error ||e^|| terms in flT^ by replacing them with terms involving only distances. This 
can be done by invoking (IT^ which essentially provides an npper bonnd for the gradient errors 
in terms of distances. Then, Lemmawith I 4 = dist^ applies and provides the convergence 
rate. 


Theorem 3.3. Suppose that Assumption \3.1\ holds. Consider the lAG iterations ()2|)-()5|) 
with an integer gradient delay parameter K > 0 and a constant stepsize 0 < 7 < 7 where 


7 


a/i \ 1 

l^J yi + L 


with a = 8/25. Then, lAG iterates {x^} are globally linearly convergent. Furthermore, when 
7 = 7 *:= 7/2 we have for k = 1,2,..., 


< 1 


Ck 


f{x^)-f{x*) < - 1 


(Q + 1 )^ 


Ck 


\x — X 


{Q + iy 


2k 


I 0 * 112 

\x — X , 


where ck = -^ [K{2K + 1)] 


-1 


( 21 ) 

( 22 ) 


Proof. As iL > 0, there are gradient delays and the error term dehned by 0171) that 
appears in the evolntion flT^ of iterates is non-zero in general. Therefore, the convergence 
rate is limited by how fast this error term decays. Assnme 7 < so that flT^ is applicable. 
Using the triangle ineqnality on Ek, we see that 


Ek\ < + 27||e'^||||a;'' - x* - 7V/(a;'')|| < 7'l|e"ir + 27||eidisU 


where we nsed (|2nl) in the last step. Using (1T5|) on the gradient error, we obtain 
\Ek\ < max dist?-|- 67 ^L^A' max dist^distfc 

(k-2K)+<t<k-l {k-2K)+<t<k-l 

< + Q'-y^L'^K) max dist«. 

^ ^ {k-2K)+<£<k 

Flagging this bonnd into the recursive inequality (IT^ for distances leads to 

distfc+i < p( 7 )distfc + ^( 7 ) max dist^ 

^/C ^ jV J rC 











with 


^( 7 ) = 1 — 2'^— —g( 7 ) = + 6'y^L‘^K. 

fi + L 

If the stepsize is small enough satisfying the condition 5 ( 7 ) := pH)+q{'y) < 1, then by Lemma 
13.21 applied with I 4 = dist^, lAG is globally linearly convergento Ignoring the positive 0 ( 7 ^) 
term in q{'y), this condition would require at least 


27 -^ + Q'y‘^L‘^K < 1 
/U + L 


0 < 7 < 




3LK 


1 _ 25_ 

li + L ~ m '' 


(23) 


which would imply 




1 

9K{Q + iy 



(24) 


as both iL > 1 and Q > 1. We will show that under the slightly more restrictive condition 
0 < 7 < 7 , one can also handle the 0(7“^) term as well and guarantee 5 ( 7 ) < 1. So assume 
0 < 7 < 7 . The condition (|23|) holds and the inequality (12T|) is valid. This implies that 


g(7) = 3'y^L^K{2 + 3'y^L‘^K) < 3'j^L‘^K{2 + ^) < 


and after straightforward calculations that 


5 ( 7 ) ;= p(j) + q(^) < 1 - 27 -^^ + < 1. (25) 

p + L 4 

Then by Lemma 13^ we have global linear convergence of the sequence I 4 = dist^ to zero with 
rate p{'j) = s( 7 )^/( 2 ^+i) < x_ This shows the global linear convergence of lAG. It remains 
to show the claimed convergence rate for 7 = 7 *. Let 7 = 7 *. Note that this minimizes the 
quadratic with respect to 7 in (l25ll leading to 

^ “ 25iL(/i2 + L2) ^ ^ “ 25K{Q + iy' 

Then, as the linear convergence rate of the sequence {dist^} is p( 7 *) < 1, by taking square 
roots, the sequence {distfc} is linearly convergent with rate r* = p( 7 *)^^^ satisfying 

dist, < rpisto, r. = (i,(7,)'/<'“+7‘'^ < 1 - 

where we used and the inequality (1 — a:)“ < 1 — ax for a:, a G [ 0 , 1 ] to get an upper 
bound for p. This proves the rate (jUJ). Then, follows directly from ([H]) and (jUJ). □ 

^Note that when K = 0, the inequality 5 ( 7 ) < 1 is linear in 7 , whereas when iy > 0 it is fourth order in 
7 and is more restrictive. 
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Remark 3.4. (Comparison with previous results) In a related work, Tseng and Yun 
shows the existence of a positive constant C that if the stepsize 7 is small enough then lAG has 
a K-step linear convergence rate of \/l — C 7 under a Lipschitzian error assumption which is 
a strong-convexity-like condition Theorem 6.1], However, in their analysis, there is no 
explicit rate estimate; as {i) results are asymptotic, holding for 7 small enough without giving 
a precise interval for ■y, (ii) the constant C is implicit and hard to compute/approximate as it 
depends on several other implicit constants and a Lipschitzian error parameter r. Our anal¬ 
ysis is not only simple using basic distance inequalities but also the constants are transparent 
and explicit. 

Remark 3.5. (TAG versus IG) Theorem \3.3[ shows that lAG with constant stepsize is 
globally linearly convergent, however the same is not true for IG. In fact, IG with constant 
stepsize is linearly convergent to a neighborhood of the solution but does not in general con¬ 
verge to the optimal solution due to the existence of gradient errors that are typically bounded 
away from zero ^45 a consequence, achieving global convergence with IG requires to use 
a stepsize that goes to zero and this results in typically slow convergence In contrast, the 
gradient error in lAG is controlled by the distance of recent iterates to the optimal solution, 
therefore it is attenuated as the iterates get closer to the optimal solution and a diminishing 
stepsize is not needed to control the error. 

Remark 3.6. (Local strong convexity implies local linear rate) We note that when f 
is not globally strongly convex but only locally strongly convex around a stationary point, for 
instance when the Hessian is not degenerate around a locally optimal solution, by a reasoning 
along the lines of the proof of Theorem VJ.A. it is possible to show that lAG is locally linearly 
convergent. 

4 lAG with momentum 

An important variant of the GD method is the heavy-ball method [18] which extrapolates 
the direction implied by the previous two iterates by the following update rule: 

_ yVf{x^) + II{X^ - X^-^) 

where /9 > 0 is the momentum parameter. It can be shown that the heavy-ball method can 
achieve a faster local convergence than GD when is in a certain range [181 Section 3.1]. 
There has also been much interest in understanding its global convergence properties mm- 
Accelerated gradient methods introduced by Nesterov mm can also be thought of as 
momentum methods where the momentum parameter is variable and appropriately chosen. 
There has been a lot of recent interest in these accelerated methods as they have optimal 
iteration complexity properties under some conditions |16] . 

In contrast to the recent advances in non-incremental methods with momentum, there has 
been less progress on incremental methods with momentum. In particular, no deterministic 
incremental methods with favorable convergence characteristics similar to those of accelerated 
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gradient methods are currently known. However, there is the IG method with momentum 
which consists of the inner iterations 

starting from x\ G M” with the convention that x\'^^ = x^j^i where 7 ^ is the stepsize 
|Tl[T3l|25]. This method can be faster than IG on some problems especially when gradients 
have oscillatory behavior, however it would still require the stepsize go to zero due to gradient 
errors, leading to typical sublinear convergence [22]. It is natural to ask whether lAG with 
such an additional momentum term, which we abbreviate by lAG-M, 

x^+^ = x^-jg’^ + Pix’^-x^-^), fc = 0,1,2,..., (27) 

would be globally convergent for [3 in some range (0,,d). We expect that this algorithm 
can outperform lAG in problems where the individual gradients show oscillatory behavior 
because the momentum term provides an extra smoothing/averaging affect on the iterates. 
The global linear convergence of the lAG-M method for a certain range of (3 values can be 
shown by a similar reasoning along the lines presented in Section [31 Most of the logic in the 
derivation of the inequalities (ITT]) - fiT3]) apply with the only difference that the 
terms will now contain an additional momentum term due to the modihed update rule fl27|l . 
We however provide a sketch of the proof in the Appendix [^ for the sake of completeness. 

5 Discussion 

We analyzed the lAG method when component functions are strongly convex by viewing 
it as a gradient descent method with errors. To the best of our knowledge, our analysis 
provides the hrst explicit linear rate result. Furthermore, it is different than the existing 
two approaches [S] and [ 21 ] in the sense that {i) it is based on simple basic inequalities that 
makes global convergence analysis simpler, {ii) gives more insight into the behavior of lAG. 
In particular, our analysis shows that the gradient errors can be treated as shocks with a 
hnite duration which can be bounded in terms of distance of iterates to the optimal solution. 
Therefore, by choosing the stepsize small enough and using the strong convexity properties 
we can guarantee that the the distance to the optimal solution shrinks down at each step by 
a factor less than one. 

We also developed a new algorithm, lAG with momentum, and provided a linear conver¬ 
gence and rate analysis. It is expected that this algorithm can outperform lAG in problems 
where the individual gradients show oscillatory behavior, because the momentum term pro¬ 
vides an extra (averaging) smoothing affect on the iterates. 

We note that the extension of lAG to the generalized version of ([1]), mina;gRn + 

h{x) with h : M” —)■ M convex and possibly non-smooth (such as the indicator of a function 
when there are constraints) is simple by an additional (proximal) step, see [21]. Extending 
our linear rate results to this case may be possible and is ongoing future work. 
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A Proof sketch of the global linear convergence of the 
lAG-M method 

Using the gradient error bound ffT^ and iterate update equation fl27|) of lAG-M, 

k—1 k—1 


\e^\\ = \\g^-Vf{x^ 


x-'--x-'\\=L ^ hg^ + l3{ 


U + l — rriw — 


I x^ — x^ ^ I 


^ i E 

j={k-K)+ 3 = {k-K)+ 

k—1 k—1 

< 7i E 119^11+E 

j={k-K)+ i={k-K)+ 

k—1 k—1 

< fL (llv/M|| + rl|) + ^)r ^ 

j=(k-K)+ i={k-K)+ 


X^ — X^ 


W — x^ ^1 


Then, using ([7]) to bound the norm of the gradient and (fT2|) to bound ||e-^ ||, this becomes 


\x^ — x^ ^1 


\xi — ^1 


k—1 k—1 

^ (^distj + lle^ll)+/3L 

j={k-K)+ j={k-K)+ 

k—1 j—1 k—1 

< 'yL 'Y {Ldistj + L Y^ \\x^~^^ — x^\\) + (3L Y^ 

j={k-K)+ i=U-K)+ j=ik-K)+ 

k—1 j—1 k—1 

< 'yL Y^ (Ldistj + L Y^ (dist^ + disU+i)) + /3L Y^ 

j={k-K)+ i=U-K)+ j=ik-K)+ 

k-1 

< (3'jL'^K) max disU + PL E 11^9(28) 

( k — ‘2ii\ )-L k —1 


\x^ — x^ ^1 


j={k-K)+ 


Note that when (3 = 0, this inequality reduces to flThll obtained for lAG. From the inner 
update equation fl27j) . we also have 

\\x^ — x^~^\\ < 7||5''^~^|1 + (3\\x^~^ — x^~‘^\\ 

E 7||5''^~^|| + /d(||(r-^~^ “ ^*11 + “ ^*11) 

< 7 ll V/(x^"^)|| + 7l|e^"^ll + /3(distj_i + distj.a) 

< 7 Ldistj_i + 7 ||e-^“^|| + 2(3 max(distj_i, distj_ 2 ) 


(29) 


where we used (I7j) to bound the norm of the gradient in the last inequality. We next bound 
the gradient error term ||e-^“^|| on the right-hand side. A consequence of flTTl) . the triangle 
inequality and the boundedness of the gradient delays is that gradient error is bounded by 


< > Li(dist.,fe -f distfc) < 2L max dist^, k >0. 

‘ (k-K)+<i<k 

2=1 


(30) 
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Combining all the inequalities fl28l) . fl2^ and flHOl) together, leads to 

\\e%<U-iL^K + pLK{‘i-iL + 2P)\ max dist^, (31) 

V j (^Jc — ‘2il ^——1 

which is an analogue of flT^ for lAG-M. Then, following the same line of argument with the 
proof of Theorem 13.31 it is straightforward to show a linear rate for lAG-M as long as the 
momentum parameter f3 is not very large. More specihcally, it follows that when (3 < fry^ 
with p > 1/2 and a positive constant b and 7 is small enough, lAG-M is globally linearly 
convergent. The bounds on 7 and b that guarantee linear convergence can also be derived 
in a similar way to the proof technique of Theorem 13.31 We omit the details for the sake of 
brevity and leave it to the reader. 
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